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Abstract 

Background: The discovery of single-nucleotide polymorphisms (SNPs) has important implications in a variety of 
genetic studies on human diseases and biological functions. One valuable approach proposed for SNP discovery is 
based on base-specific cleavage and mass spectrometry. However, it is still very challenging to achieve the full 
potential of this SNP discovery approach. 

Results: In this study, we formulate two new combinatorial optimization problems. While both problems are 
aimed at reconstructing the sample sequence that would attain the minimum number of SNPs, they search over 
different candidate sequence spaces. The first problem, denoted as SNP—MS-p, limits its search to sequences 
whose in silico predicted mass spectra have all their signals contained in the measured mass spectra. In contrast, 
the second problem, denoted as SNP - MSg , limits its search to sequences whose in siiico predicted mass spectra 
instead contain all the signals of the measured mass spectra. We present an exact dynamic programming 
algorithm for solving the SNP-MS-p problem and also show that the SNP - MSg problem is NP-hard by a 
reduction from a restricted variation of the 3-partition problem. 

Conclusions: We believe that an efficient solution to either problem above could offer a seamless integration of 
information in four complementary base-specific cleavage reactions, thereby improving the capability of the 
underlying biotechnology for sensitive and accurate SNP discovery. 




Background 

Single nucleotide polymorphisms (SNPs) is a common 
type of DNA sequence variations that occur when a single 
nucleotide base is altered at a specific locus. They are 
among the most important genetic factors that contribute 
to human disease and biological functions. However, dis- 
covering novel SNPs is a scientifically challenging task. 
Among others, one valuable approach proposed for SNP 
discovery is based on base-specific cleavage and mass 
spectrometry [1-3]. 

The SNP discovery approach based on base-specific 
cleavage and mass spectrometry usually adopts a data- 
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acquisition procedure as summarized below. First, a 
target sample DNA sequence is PCR-amplified using pri- 
mers that incorporate the T7 promoter sequences. Then, 
the PGR products are in-vitro transcribed and subse- 
quently digested with the endonuclease RNase A in four 
base-specific cleavage reactions. Each reaction can cleave 
the sample sequence to completion at all loci wherever a 
specific base is found. Finally, the matrix-assisted laser 
desorption/ionization time-of-flight mass spectrometry 
(MALDI-TOF MS) is applied to the cleavage products, 
resulting in four measured mass spectra, each corre- 
sponding to one base-specific cleavage reaction. 

Since each cleavage product is expected to be made of 
three non-cleavage bases, it is fairly straightforward to 
calculate the base composition from its measured mass 
signal. With all these base compositions in hand, the task 
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of discovering SNPs in the sample sequence is now left to 
a computational solution. In principle, this computational 
solution shall find a way to integrate the four comple- 
mentary base-specific mass spectra, and then identify 
those SNPs that necessarily account for the unanticipated 
base compositions (i.e., corresponding to the measured 
mass signal changes as compared with an in-silico 
predicted mass spectra from a reference sequence). 
See Figure 1 for schematic outline of the SNP discovery 
approach using base-specific cleavage and mass 
spectrometry. 

The early proof-of-concept studies on the above SNP 
discovery approach using base-specific cleavage and mass 
spectrometry were presented in [3-5], where the identifica- 
tion of SNPs however was done by visual inspection. 
Shortly afterwards, two automated computational solu- 
tions were developed [1,2]: one was implemented in the 
proprietary MassARRAY"''^ SNP Discovery software pack- 
age from Sequenom, Inc. and the other implemented in 
the software package called RNaseCut which is instead 
freely available online [6]. In particular, the solution in [1] 
mainly comprises of two separate procedures. It first com- 
putes all potential SNPs that give rise to each unantici- 
pated based composition and then score them by taking 
into account the mass spectrometry data from the four 
base-specific cleavage reactions. Thus, the integration of 
the four base-specific cleavage reactions was done only in 
the second step. Apparently, such an integration strategy 
is far from being optimal, as at least it assumes that the 
occurrences of potential SNPs are independent in the first 
step. 

In this paper, we study two new combinatorial optimi- 
zation problems to exploit the full potential of the above 
SNP discovery approach. While both problems are aimed 
at reconstructing the sample sequence that would attain 
the minimum number of SNPs, they search over different 



candidate sequence spaces. The first problem, denoted as 
SNP—MS-p , limits its search to sequences whose in silico 
predicted mass spectra have all their signals contained in 
the measured mass spectra. In contrast, the second pro- 
blem, denoted as SNP - MSg , limits its search to 
sequences whose in silico predicted mass spectra instead 
contain all the signals of the measured mass spectra. 
Then, we present an exact dynamic programming algo- 
rithm for solving the SNP—MS-p problem and also show 
that the SNP - MSg problem is NP-hard by a reduction 
from the restricted variation of the 3-partition problem 
[7,8]. 

Methods 

Preliminaries 

Let s e S* denote a string over the four-base alphabet 
J2 = {A,C,G,T}. The length of s is denoted by 1 5 1, the 
i-th base of s by s[i], and the substring of s from the i-th 
base to the ;-th base by s[i, j], for 1 < j < / < We use/, 
to denote the empty string so that \L \ =0. The concate- 
nation of two strings s and t is denoted hy s ■ t, and the 
concatenation of / copies of a string s is denoted by 5'. 

Given a string s and a cut base x G a cleavage frag- 
ment refers to a substring of s that does not contain x 
and that cannot be extended in either side without 
crossing a base x. Formally, the substring s[i, j] is a clea- 
vage fragment with respect to the cut base x if the fol- 
lowing three conditions are satisfied: (i) s{i - \\ = x ii 
i *. 1, (ii) s\i -\- V\ = xiij ^ \s\, and (iii) s[k] * x, V/ce [i, /]. 
In addition, the empty string e is a cleavage fragment if 
there exits i e [1,|.5| - 1] such that s[i] = s[i + 1] = x. 
Given a cleavage fragment, we use A,CjGfeTj to denote its 
base composition of / As, J Cs, k Gs, and / Ts. In [1], this 
base composition is termed as a compomer of the string 5 
with respect to the cut base x. The whole set of compo- 
mers is hence called the compomer spectrum of the string 
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Figure 1 Schematic outline. The SNP discovery approach using base-specific cleavage and mass spectrometry. 
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s with respect to the cut base x, and denoted by Finally, 
let Cj:(s) = {C,(s) ■.X€J2] = {CA(s),Cc(s),CG(i),CT(s)}, 
a collection of four compomer spectra of the string s 
where each is generated with one cut base. 

Example 1 Let s := ACATGCTACATTA. Then, the 
string s contains four cleavage fragments with respect to 
the cut base A: C, TGCT, C, and TT. With respect to the 
cut base T, it instead contains five cleavage fragments: 
ACA, GC, ACA, L and A. Their respective compomer 
spectra are Ca (s) = IAqCiGoTo, A0C1G1T2, A0C0G0T2) and 
Ct(s) = {A2CiGoTo,AoCiGiTo,AoCoGoTo,AiCoGoTo). Note that 
each compomer appears in a compomer spectrum at most 
once. 

Problem formulation 

Let dn [s, s') denote the Hamming distance between two 
strings s and 5' of equal length. It measures the mini- 
mum number of substitutions required to transform one 
string into the other. Given a collection of compomer 
spectra Ce = {C^ : x e E} of an unknown string $' (i.e., 
the sample DNA sequence experimented) which can in 
principle be generated from a mass spectrometry experi- 
ment, and a string s (i.e., the reference DNA sequence) 
which is believed to differ from the unknown string $' 
by a number of substitutions only, we formulate below 
two combinatorial optimization problems for SNP 
discovery. 

Definition 2 (The SNP — MSp problem) Given a 
string s and a collection of compomer spectra 
Ce = {Cx : X e E}, find a string $' such that Cx{s') c C„ 
for all X € and dn [s, s') is minimized. 

Definition 3 (The SNP - MSq problem) Given a 
string s and a collection of compomer spectra 
Cy, = {Cx : X & Yj}, find a string $' such that Cx C Cx{s'), 
for all X ^f^d (^< 5') is minimized. 

The only difference between the above two problem 
formulations is that one requires Cx{s') c Cx and the 
other requires Cx Q Cx[s'), for all the cut bases. Once 
the string s' is found, it is easy to identify the SNPs in $', 
i.e., those base substitutions that transform 5' into 5. 

Example 4 In this example, we let J] := {A, T] for sim- 
plicity. Given the string s:= ATAAT and the set 
C = {Ca,Ct} of compomer spectra (of an unknown string) 
where 

Ca = {AoTi,AoT2} and Ct = {AqTo, AiTq}. 

The feasible solutions to the SNP—MS-p problem for 
the above instance include the strings such as ATATA, 
TATAT, TTATT, AT ATT, and ATTAT. Their respec- 
tive Hamming distances to the input string s are 2, 3, 2, 
1, and 1. The string s' = TTAAT is not a feasible solu- 
tion because the compomer A2T0 e Ct(s') but A2T0 ^ Ct 
so that Ct(s') g Ct. 



The feasible solutions to the SNP — MSq problem for 
the above instance include the strings such as TTATA, 
TATTA, ATATT, and ATTAT. Their respective Ham- 
ming distances to the input string s are 3, 5, 1, and 1. 
The string $' = TTAAT is not a feasible solution because 
the compomer AjTq g Ct but AiTq ^ Ct(s') so that 
Ct 2 Ct{s'). 

The measured mass spectra of a sample sequence are 
rarely perfect in practice. Some peaks may actually 
represent noises, while some true signal peaks are miss- 
ing. The problem SNP—MS-p is so formulated that its 
computational solution would be robust against noisy 
peaks but susceptible to missing peaks (i.e., there is a 
good chance to recover the sample sequence even if 
some noisy peaks are present in the measured mass 
spectra, but the chance would become much less if 
there are some true signal peaks missing). In contrast, 
the problem SNP - MSq is so formulated that its com- 
putational solution would be robust against missing 
peaks but susceptible to noisy peaks. 

We noticed that several computational problems in 
the literature that are more or less related to our pro- 
blems introduced above. In [9], a so-called sequencing 
from compomers problem was studied which, like the 
SNP—MS-p problem, also aimed to reconstruct the sam- 
ple sequence from a given collection of compomer spec- 
tra, but without help of a reference sequence. In [10], 
the spectral alignment problem differs from the 
SNP—MSp problem mainly by its exploration on short 
read sequencing data rather than the mass/compomer 
spectra data, which may lead to wide implications in the 
subsequent algorithm design and complexity analysis. 
Moreover, in [1], a so-called SNP discovery from mass 
spectrometry problem was defined in a similar way to 
the SNP - MSq problem. However, it has only a single 
compomer as input, as opposed to a collection of 
four complementary compomer spectra used in the 
SNP - MSq problem. 

Results 

An exact dynamic programming algorithm for 

SNP-MSp 

In this subsection, we shall describe an exact dynamic 
programming algorithm for solving the SNP—MSp pro- 
blem. Without loss of generality, we may assume in the 
remaining of this section that every base of S will even- 
tually occur in the optimal solution to a given instance 
of the SNP—MSp problem. Consequently, only those 
feasible solutions that contains all the bases of 2 need to 
be considered when we search for the optimal solution. 
In case some base x would not occur in the optimal 
solution 5' note that it becomes relatively easy to find $' 
since we would have s' € Cxr\TZx and = |s|. See 
below for definitions of jCx and TZx. 
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Let us start with some preliminary definitions and nota- 
tions. For a string s, a cleavage fragment s[i, J] is called 
internal if neither i = 1 nor / = left-ended if i = 1, or 
right-ended if / = In addition, a cleavage fragment [ is 
always considered internal. Given a collection of compo- 
mer spectra Cj^ , we call a string is I-compatible if the 
compomers of its internal cleavage fragments are all con- 
tained in (under the respective cut base). A string is 
called L-compatible (resp. R-compatible) if it is I-compati- 
ble and if the compomers of its left-ended (resp. right- 
ended) cleavage fragments are all contained in Cj^ as well. 

Example 5 Consider the string s given in Example 1. 
The four cleavage fragments of s with respect to the cut 
base A are all internal. Among the five cleavage frag- 
ments of s with respect to the base T, the first cleavage 
fragment ACA is left-ended, the last cleavage fragment 
A is right-ended, and the other three cleavage fragments 
in the middle are all internal. 

Example 6 Let C-£ = {Ca, Cq, Cq, Ct} be a collection of 
compomer spectra where 

Ca = {AqCiGoTo, A0C1G1T2, A0C0G0T2}, 

Cc = {AiCqGoTo, AiCoGiTi, AiCoGoTi, A2C0G0T2}, 

Cg = {A2CiGoTi,A3C2GoT3}, and 

Ct = {A2C1G0T0, AqCiGiTo, AqCoGoTo, AiCqGoTo}. 

We show in Table 1 whether each of the given strings 
is I-compatible, L-compatible, or R-compatible with Cj^ . 

For each compomer AiCjGfeT/ e Cx in a given collection 
of compomer spectra Cj^, we use Ixi^iCjGk'l)) to denote 
the set of strings that (i) consist of i As, / Cs, k Gs, / Ts, (ii) 
contain exactly three distinct bases (i.e., three bases in the 
set S \ {x}), and (iii) are I-compatible with Cj^. It is easy 

to check that \Ix[AiCjGkTi)\ < In particular, if 

there exists in A;C,- G^^T; a non-cut base whose composi- 
tion value is zero, then we have |Ix(A)CjGfeT;)| = 0 so that 
\Ix{AiCjGkTi)\ = 0. Furthermore, we may define the fol- 
lowing set 

lx= U TxiAiCjGkTi), VxeE. 

AjQGfcT,eQ 



Table 1 Examples. 
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This table shows whether each of the given strings is l-compatible, 
L-compatible, or R-compatible with Cj^ . 



Then, let Is = {2a, Ic/ 2g/ ^t]- Analogously, we may 
define £.(AiqGfeT,), Te^CAqCfeT,), = [U, Cc, Cc, Cj] 
and 7?.s = {R.^, TZc, TZq, T^xlfor the L-compatible strings 
and the R-compatible strings, respectively. Clearly, Lx C Ix 
and Ttx ^ Ix, for all * g S. 

Example 7 Consider the collection of compomer spectra 
^T. 5''veK in Example 6. For the compomer A0C1G1T2 e Ca, 

we have Ia(AoCiGiT2) = (CGTT,CrTG,GCrr,GTrC,TCGT,TGCT,TTCG,TrGCl, 

and /:a(AoCiGiT2) = 7eA(AoCiGiT2) = 0. For the compomer 
AoCiGiTo e Ct, we have Zt(AoC,GiT„) = £t(AoC,g,To) = 7Jt(a„CiGiT„) = 0. 

Given a string t which could be a potential cleavage 
fragment with respect to the cut base x (i.e., the string t 
does not contain any base x), we say a string 5 begins 
with the string t if t ■ x is a prefix of s ■ x, or say a string 
s ends with the string t if x ■ t is the suffix of x ■ s. The 
following lemma is useful to design a dynamic program- 
ming algorithm for solving the SNP—MS-p problem. Its 
easy proof is omitted. Recall that our discussions in this 
section are limited only to the feasible solutions contain- 
ing all the bases of S. 

Lemma 8 A string s' of length \s\ is a feasible solution 
to the SNP—MS-p problem if and only if 

- all the substrings of ^ are I-compatible with Cy^, 

- 5' begins with a string in Cxfor some x e 2, and 

- s' ends with a string in Ti-xfor somex e 

Suppose we have an input instance (-S/C^^ of the 
SNP—MS-p problem. Given a string t € Ix where x e J], 
we define 7i (i, t) to be the minimum Hamming dis- 
tance between the prefix of 5 of length i and a string 
which is such that 

- all its substrings are I-compatible with Cj^ , 

- it begins with a string from Cy for some j e S, and 

- it ends with the given string t. 

To compute H (i, t), we first find in the string x ■ t the 
rightmost position k at which the base {x ■ t){k\ is its 
first occurrence. Formally, we may write 

fe = maxy : Vt, 1 < t < j < |x • t|, (x ■ t)[i] f (x • t)[;]}. 

Then, let x':= {x ■ t)\k\,p := {x ■ t)[l, k - 1], and q := {x ■ t) 
[k,\ X ■ t\]. Note that x' ^ x and the string p contains all the 
bases of S except xf. 

Example 9 Let t := CGTT L /a- Then, x ■ t = ACGTT, 
A: = 4, = T, p = ACG, and q = TT. 

To compute 7^(1, t), we now use the following recur- 
rence relation 

n{i, t) = min {H^i - \q\, t') + dHis[i - + 1, i], q)]. 

fel/ 

3t",t' = t"-p 
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Note that the minimization in the above is taken over 
all those strings t' in Ix' which have p as the suffix. If 
there is no such a string in Ix', then we let Hij, t) = oo. 
The initial conditions for the recurrence relation are 
given as follows: 



n{i, t) = 



00 if I < \t\ and t e 2^ 

dH{s[l, i], t) if i = \t\ and t € Cx 
oo if i = |t| and t e Ix\t^x- 



Theorem 10 Let s' be the string that leads to 



dH{s, s') 



min H(\s\, t), 



then s' would be an optimal solution to the input 
instance (5, Cj^) of the SNP - MS-p problem. 

Proof: For the correctness of the above dynamic pro- 
gramming algorithm, we need to show that (i) every fea- 
sible solution of the SNP— MS-p problem would be 
essentially evaluated by the dynamic programming algo- 
rithm, and (ii) every string evaluated by the dynamic 
programming algorithm must be a feasible solution of 
the SNP-MS-p problem. 

Let the string s' be a feasible solution. Consider a clea- 
vage fragment t of s that contains all the bases of S except 
its corresponding cut base x. Clearly, t &Ix and t is the 
suffix of a substring s'[l, i] for some integer Without loss 
of generality, we can further suppose that t ^ s'[l, i]. To 
show (i), what we mainly need to show is that there exists 
a string f e J^; such that p is the suffix of f and tf is the 
suffix of the substring s'[l, i - \q\], where x', p, and q are 
computed for the string t as described earher. Indeed, we 
can find the string t as follows. First, let (/' - 1) be the 
position of the last occurrence of the base x' in the sub- 
string s'[l, i - \t\]; if there is no such occurrence, we let 
i' = 1. Then, we assign tf := s'[i',i— \q\]. Obviously, t' is the 
suffix of 5'[1, i \q\]. Because s'[i - \t\] = x and x ^ x , we 
have /' < i - \t\. It then follows from p = s'[i - \t\, i - \q\] 
that p shall be the suffix of t'. Since p contains all the 
bases of ^ except x so, does t'. Moreover, f is a cleavage 
fragment of s' with respect to the cut base x' because we 
have either s'[i' - 1] = x' or i' = 1 on the left end of tf and 
s'[i - \q\ + 1] = x' on the right end of t'. By Lemma 8, we 
can see that tf e I^. For the reader's convenience, we 
demonstrate in the following example how to find f from 
t. Let s = ACATGCTACATTA, t = s [4,7] = TGCT, / = 7, 
X = A, and be the one as given in Example 6. Note 
thatt e Xa- Further, for the given string t = TGCT, we 
have x' = C, p = ATG, and q = CT. Then, we obtain that 
i' = 3 and then t' = s [3, 7 - 2] = s [3,5] = ATG. It is easy 
to check that p is the suffix of is the suffix of the sub- 
string s'\\,i— \q\ \ and t' e l^'- 

On the other hand, let s' be a string evaluated by the 
dynamic programming algorithm. So, the string s' must 



begin with a string in Cx for some x € ^^'^ ^nd with 
a string in TZy for some K € Consider a cleavage frag- 
ment t of 5' that was used to construct the string s' dur- 
ing the backtracking procedure of the algorithm. 
Clearly, the string t contains all the bases of J] except 
its corresponding cut base x. Moreover, t € Xx and t is 
the suffix of a substring s'[l, /] for some integer i. With- 
out loss of generality, we can further suppose t ^ s'[l, i] 
and i ^ |s'|, so that s'[i - \t\] = s'[i + 1] = x. Let t' be the 
string considered next to the string t during the back- 
tracking procedure of the algorithm. Thus, we have 
tf € Xx! such that p is the suffix of t' and t! is the suffix 
of the substring s'[l, / - |^|], where p, and q are com- 
puted for the string t as described earlier. More specifi- 
cally, there exists i' such that t' = s'{i', i -\qW and s [i' 
-1] = s' [i -\q\ + 1] = x' if i' * 1. To show (ii), by 
Lemma 8 and also by backward induction, what we 
mainly need to show is that the extended substring s'[i',\ 
s'W is I-compatible with C^^, given that the substring 
s'{i - \t\ + 1, |5'|] is already I-compatible with Cy_. To 
this end, we consider any internal cleavage fragment s'\j, k] 
of s [i', \s'W with respect to the cut base x" = s'\j - 1] = 
s'{k + 1]. By definition of the internal cleavage fragment, 
we have / > i' + 1 and k< \s'\ - 1. In the following we dis- 
tinguish four cases: 

- li j > i - \t\ +2, then s'\j, k] is an internal cleavage 
fragment of s'[i - \t\ +1, |s'|]. Since s'{i - \t\ +1, |5'|] 
is already assumed to be I-compatible with C^^, the 
base composition of s'[ /, k\ shall be also contained 

in Cx!'- 

- If / = i - |i| + 1, then x" = x, which further implies 
that k = i and 5' [/', k] = t. Since t e Xx, the base com- 
position of s'\j, k] shall be contained in Cx"- 
-lij<i- \t\ and k > i - \q\, then s'[i - \t\, i - \qW 
is a substring of s'\j, k\. Since s{i - \t\, i - \qW con- 
tains all the bases of S, the string s'\j, k\ can not be 
a cleavage fragment (as a cleavage fragment must 
not contain its corresponding cut base). Therefore, 
there shall not have the case where j ^ i - \t\ and 
k>i- \q\. 

- li k < i - \q\ - 1, then s'\j, k] is an internal cleavage 
fragment of t' = s'[i', i - \q\]. Since t e It', the base 
composition of s'lj, k] shall be contained in Cx»- 

In conclusion, for every internal cleavage fragment of 
s'[i, |s |], its base composition is contained in Cj^ under 
the respective cut base. Therefore, the extended sub- 
string s'[i', \s'W is still I-compatible with Cj^- 

Note that computing each entry i) of the dynamic 
programming table may take time O (|s| • where 
\Xy} = |2aI + |2cl + I^gI + |2t|. Hence, the above 
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dynamic programming algorithm can be done in time 
0(|s|^ • |Zj^|^). In the worst case, we may have 
llj^l = 0(|s|!), that is, llj^l is in the factorial order of 
the input problem size. In practice, however, we would 
expect |Z^| not too large to be manageable, because 
cleavage fragments are usually of small size. Therefore, 
the above dynamic programming algorithm could be a 
practically feasible solution to the problem SNP—MSp , 
especially when compared to the brute-force algorithm 
which needs to examine all the possible strings s'. For 
the special case where \ = 2, SNP—MS-p is actually 
an easy problem, as we can see from the above that 
I^eI -0(\s\). 

Corollary 11 The above dynamic programming algo- 
rithm can solve the SNP — MS-p problem in polynomial 
time when | ^ | = 2. 

The NP-hardness of SNP - MSq 

This subsection is dedicated to prove that the 
SNP - MSg problem is NP-hard. We begin with a brief 
introduction of the 3-partition problem. 

Definition 12 (The general form of the 3-partition 
problem) Given a multiset of positive integers 

En 
di = niB, 
7=1 

can we partition the multiset A into m multisets 
AiiAj, ■ ■ ■ ,Am> such that the sum of each multiset is 

equal to B? 

The 3-partition problem is strongly NP-complete [7]. 
Therefore, it remains NP-complete even when the inte- 
gers in A and the integer B are encoded in unary. In 
this case, the size of a problem instance is @{nB). In 
contrast, it becomes 0(« log B) when using the binary 
encoding of integers. 

Definition 13 (The restricted variation of the 3-par- 
tition problem) Given a set of positive integers 

En 
Oi = mB, and 
1=1 

I < ai < |, VI < I < n, can we partition the set A into 
m subsets AiiAi, ■ ■ ■ ,Am> such that the sum of each 
subset is equal to B? 

There are two constraints imposed in the above 
restricted variation of the 3-partition problem. The first 
one limits ^ to be a set so that all the integers in A are 
distinct. The second one limits all the integers in A 
strictly between | and |, which subsequently enforces 
every subset Ai to consist of exactly three elements. 
Interestingly, this restricted variation of the 3-partition 
problem remains strongly NP-complete [8], just like the 
general form of the 3-partition problem. Note that the 
second constraint | < a; < | was actually not imposed 
in [8]. But, it can be easily done by adding B to each 
and then multiplying B by 4. 



Theorem 14 The SNP — MSq problem is NP-hard, 
even when | ^ | = 2. 

Proof: We prove it by a reduction from the above 
restricted variation of the 3-partition problem. As an 
input for 3-partition, we are given a set of distinct posi- 
tive integers A = {a\, a-i, ■■■ , an) where n - 3m, 

y^ , ^ flj = mB, and f < flj < |, VI < i < n. Then, we 

construct an instance {s,C-£) of the SNP - MSq problem 
as follows: 

- Let S = {G, T}. 

- Let s be the string such that s • T = (G^^^T)*". That 
is, let s • T be the concatenation of m copies of the 
fragment GG • • • GT, where each fragment consists 
of [B + 2) consecutive base Gs followed by one base 
T. Note that |s| = miB + 3) - 1 = mB + 3m - 1. 

- Let Cc = {GoTo, GoTi) and Cj = {Ga,.To : 1 < J < n} 
so that Cj2 = {Cgi Ct}. 

First, we check whether this construction can be done 
in polynomial time in the size of the input instance of 
the 3-partition problem. Since the restricted variation of 
the 3-partition problem is strongly NP-complete, we 
may encode the integers in unary so that the size of the 
input instance is @{nB). In the above reduction, we can 
easily see that the first step can be done in constant 
time, the second step in time 0{mB), and the third step 
in time 0{n log B). Therefore, the total time needed for 
construction is 0{nB), no more than time polynomial in 
the size of the input instance of the 3-partition problem. 

Next, we show that every feasible solution s" to the 
reduced instance (-S, Cj^) of the SNP - MSg problem is 
such that (i) Ct(s") = Ct, (ii) s" contains exactly 3m - 1 
base Ts, and (iii) dn {s, s") > 2m. For each compomer 
G^To e Ct C Cj[s"), there exists at least one cleavage 
fragment G* in s" that is obtained with respect to the 
cut base T. Since all the integers are distinct, all such 
cleavage fragments shall be pairwise non-overlapping. 

En 
= mB base 

Gs and at least k - 1 = 3m - 1 base Ts. On the other 
hand, since |s| = mB + 3m - 1, the string s" hence con- 
sists of exactly mB + 3m- 1 bases. Therefore, we can 
deduce that s" contains exactly 3m - 1 base Ts and 
further that Ct(s") cannot have any other compomer 
than those in Cx. By construction, we also know that 
the string s contains exactly m - 1 base Ts, which 
hence implies that dn (s, s") > 2m. 

Now, we are going to show that there exists a valid 
partition for the input instance of the 3-partition pro- 
blem if and only if there exists an optimal solution s for 
the reduced instance of the SNP - MSg problem such 
that dn {s, s') = 2m. 
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Suppose that A can be partitioned into m sub- 
sets^i, ^2/ • • • / -4m such that, for each subset 
Ai = {aij, ajj, fljj}, its size is three and its integer ele- 
ments adds up to exactly B, that is, \Ai = 3|and 
^ flij = B,V1 < z < m. Then, we use the following 
procedure to find the string s': 

1. s' := 0; 

2. for j = 1 to w 

3. for = 1 to 3 

4. = G^'^T; // append the string g^'^T to 5' 

5. end 

6. end 

7. 5':= s'[l, \s'\ - 1]; // remove the last base T 

As one can easily check, the resulting string 5' is such 
that = mB + 3m - 1,CqQ Cq{s'), and Cj c CtC/). 
Therefore, 5' is a feasible solution to the reduced 
instance (5, of the SNP - MSq problem. On the 

other hand, since > ai = B,yi < i < m, we can 

deduce that s'[k] = s[k] if s'[k] = G or s[k] = T; other- 
wise, s [k] ^ s[k], V/c e [1, mB + 3m - 1]. Therefore, dn 
is, s') =\[k : s-[k] * s[k\}\ = |5| - |{^ : m = s[k]}\ =mB + 
3m-l-\{k: s'[k] = G}| - : slk] = T}\ = mB + 3m - 
1 - mB - m + I = 2m. It hence follows that s' is indeed an 
optimal solution to the reduced instance (s, C-£) of the 
SNP - MSq problem. 

Conversely, suppose that the string s' is an optimal 
solution to the reduced instance {s,Cj2) of the 
SNP - MSq problem such that dfi{s, s') = 2m. Then, we 
use the following procedure to find a partition 
Ai,A2,--- ,Am of A: 

1. s := 5 • T; s':= s ■ T; 

2. i := 1; i := 1; 

3. Ar :=/); a,. := 0; 

4. for k = 1 to mB + 3m 

5. ifs'[A:]=T 

6. Ai := Ai U [Ui.}; 

7. / + +; 

8. if s[k] = T 

9. / + +; := 1; 

10. Ai:=P; 

11. end 

12. ai- := 0; 

13. else 

14. «i, + +; 

15. end 

16. end 

It follows from the earlier discussions that 

Ct(s') = Ct = {GflTo : 1 < i < n] and also that s' 
contains exactly 3m - 1 base Ts. Furthermore, since 
dn (5, s') = 2m, we can deduce that s'[k] = s[k] if s[k] = 
T, yk e [1, mB + 3m - 1]. Notice that s[k] = T if and 
only if k can be written as a multiple of {B + 3), that 



is, k = i{B + 3) e [1, mB + 3m - 1], Vj. Therefore, 
s'[k] = T ifk = i(B + 3) e [1, mB + 3m - I], V/, which 
subsequently implies that Ct(j'|(i- - d (b + 3) + 1, i(B + 3) - 1|) c ct(5'), 
for each / e [1, m]. Note that s[{i - 1){B + 3) + 1, i{B + 3) 
- 1] is a substring of s that consists of (B + 2) base Gs; it is 
located either strictly between two consecutive base Ts or 
strictly between one base T and one end of the string s. 
Since CrisXii - 1){B + 3) + 1, i{B + 3) - 1]) S Ct(s'), we 
can let CrCTi-iKn+sj + i, i(B + 3)-i]).|G._T„, G^,T„ G.T„i gych that 

+ a,j + • • • + a;^. + j - 1 = B + 2. Since | < a;. < |, we 
can deduce that / = 3; hence flji + flj^ + di^ = B. 
LetAi = {oi^, ajj, Oi^}, for all ie [I, m]. Then, we can see 
that AiiAi, ■■; Am is a partition of A such that the sum 
of integers in each subset is equal to B. 

Extensions to edit distance 

Naturally we may extend our previous problem formula- 
tions to the edit distance (i.e., Levenshtein distance). 
The resulting two new problems are formally defined as 
follows. 

Definition 15 (The SNP-MSr problem) Given a 

string s and a collection of compomer spectra 
= {Cx'. X & E}, find a string s' such that Cx{s') e C» 
for all y. & E and ds {s, s') is minimized. 

Definition 16 (The SNP - MSq problem) Given a 
string s and a collection of compomer spectra 
Cj2 = [Cx'.x € E}, find a string s' such that Cx c Cx[s'), 
for all X and dE {s, s') is minimized. 

These extensions make it possible to detect not only 
base substitutions but also base insertions and deletions. 
Hence, they would permit the mutation discovery in 
DNA sequences (see [1]). In the Additional file 1, we 
show that both SNP-MS-p and SNP - MSq are theore- 
tically NP-hard, together with an exact dynamic pro- 
gramming algorithm for solving the SNP—MS-p 
problem. 

Conclusions 

To exploit the full potential of the SNP discovery 
approach using base-specific cleavage and mass spectro- 
metry, in this paper we have studied two new combina- 
torial optimization problems, called SNP—MS^p and 
SNP - MSq , respectively. We believe that any efficient 
solution to either problem could offer a more seamless 
integration of information in four complementary base- 
specific reactions than previously done in [1,2], thereby 
improving the capability of the underlying biotechnology 
(i.e., base-specific cleavage and mass spectrometry) for 
sensitive and accurate SNP discovery. 

Although we cannot change the inherent complexity 
of our proposed dynamic programming algorithm for 
the SNP—MSf) problem, we believe that by improving 
and optimizing its implementation, the compute 
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runtime can be significantly reduced to the extent suita- 
ble for practical use. On the other hand, the NP-hard- 
ness result indicates that in the most general situation, 
solving the SNP - MSq problem exactly in polynomial 
time is impossible unless P = NP. In more realistic 
situations where only a very few SNPs (e.g., two or three 
SNPs) occur in a target sample sequence, however, the 
problem can be quite easily tackled, e.g., using an 
exhaustive search approach. In the future work, we shall 
try to prove that the SNP—MS-p problem is NP-hard 
and develop an efficient heuristic algorithm for the 
SNP - MSg problem for practical use. 

Additional material 



Additional file 1: Extensions to edit distance The analysis results for 
the problems SNP - MSjy, and SNP - MSg^, are presented. See 
"Additional file l.pdf". 
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